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Abstract— This paper presents a theoretical and empirical 
analysis of Expected Sarsa, a variation on Sarsa, the classic on- 
policy temporal-difference method for model-free reinforcement 
learning. Expected Sarsa exploits knowledge about stochasticity 
in the behavior policy to perform updates with lower variance. 
Doing so allows for higher learning rates and thus faster learn- 
ing. In deterministic environments, Expected Sarsa’s updates 
have zero variance, enabling a learning rate of 1. We prove 
that Expected Sarsa converges under the same conditions as 
Sarsa and formulate specific hypotheses about when Expected 
Sarsa will outperform Sarsa and Q-learning. Experiments in 
multiple domains confirm these hypotheses and demonstrate 
that Expected Sarsa has significant advantages over these more 
commonly used methods. 


I. INTRODUCTION 


In reinforcement learning (RL) [1], [2], an agent seeks 
an optimal control policy for a sequential decision problem. 
Unlike in supervised learning, the agent never sees exam- 
ples of correct or incorrect behavior. Instead, it receives 
only positive and negative rewards for the actions it tries. 
Since many practical, real world problems (such as robot 
control, game playing, and system optimization) fall in this 
category, developing effective RL algorithms is important to 
the progress of artificial intelligence. 

When the sequential decision problem is modeled as a 
Markov decision process (MDP) [3], the agent’s policy can 
be represented as a mapping from each state it may encounter 
to a probability distribution over the available actions. In 
some cases, the agent can use its experience interacting with 
the environment to estimate a model of the MDP and then 
compute an optimal policy via off-line planning techniques 
such as dynamic programming [4]. 

When learning a model is not feasible, the agent can 
still learn an optimal policy using temporal-difference (TD) 
methods [5]. Each time the agent acts, the resulting feedback 
is used to update estimates of its action-value function, which 
predicts the long-term discounted reward it will receive if it 
takes a given action in a given state. Under certain conditions, 
TD methods are guaranteed to converge in the limit to the 
optimal action-value function, from which an optimal policy 
can easily be derived. 
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In off-policy TD methods such as Q-learning [6], the 
behavior policy, used to control the agent during learning, 
is different from the estimation policy, whose value is being 
learned. The advantage of this approach is that the agent can 
employ an exploratory behavior policy to ensure it gathers 
sufficiently diverse data while still learning how to behave 
once exploration is no longer necessary. However, an on- 
policy approach, in which the behavior and estimation poli- 
cies are identical, also has important advantages. In particu- 
lar, it has stronger convergence guarantees when combined 
with function approximation, since off-policy approaches can 
diverge in that case [7], [8], [9] and it has a potential advan- 
tage over off-policy methods in its on-line performance, since 
the estimation policy, that is iteratively improved, is also 
the policy that is used to control its behavior. By annealing 
exploration over time, on-policy methods can discover the 
same policies in the limit as off-policy approaches. 


The classic on-policy TD method is Sarsa [10], [11], which 
is named for the five components employed in its update rule: 
the current state and action s; and a;, the immediate reward 
r, and the next state and action s;,; and a;¢+1. The use of 
dt+1 introduces additional variance into the update when the 
estimation policy is stochastic, as is typically the case for on- 
policy methods like Sarsa. This additional variance can slow 
convergence. For this reason, Sutton and Barto proposed, 
in a little-noted exercise in their classic book [2, Exercise 
6.10], a variation on Sarsa designed to reduce variance in 
the updates. Instead of simply using a;+41, this variation 
computes an expectation over all actions available in s+. 
Though the resulting algorithm, which we call Expected 
Sarsa, may offer substantial advantages over Sarsa, it has 
never been systematically studied and is not widely used in 
practice. 


In this paper, we present a theoretical and empirical 
analysis of Expected Sarsa. On the theoretical side, we show 
that Expected Sarsa shares the same convergence guarantees 
as Sarsa and thus finds the optimal policy in the limit under 
certain conditions. We also show that Expected Sarsa has 
lower variance in its updates than Sarsa and demonstrate 
which factors contribute to this gap. 


On the empirical side, we compare the performance of 
Expected Sarsa with the performance of Sarsa and Q- 
learning. We formulate two hypotheses about the perfor- 
mance difference between Expected Sarsa and these two 
methods and confirm them using two benchmark problems: 
the cliff walking problem and the windy grid world problem. 
We also present results in additional domains verifying the 
advantages of Expected Sarsa in a broader setting. 


II. BACKGROUND 


The sequential decision problems addressed in RL are 
often formalized as MDPs, which can be described as 4- 
tuples (S, A, T, R) where 

e S is the set of all states the agent can encounter, 

e A is the set of all actions available, 

e T(s,a, s") = P(s'|s,a) is the transition function, and 

e R(s,a, 5’) = E(r|s,a, s’), is the reward function. 

The goal of the agent is to find an optimal policy 7* = 
P(a|s), which maximizes the expected discounted return: 
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where y is a discount factor with 0 < y < 1. 

All TD algorithms are based on estimating value functions. 
The state-value function V"(s) gives the expected return 
when the agent is in state s and follows policy 7. The action- 
value function Q7(s,a) gives the expected return when the 
agent takes action a in state s and follows policy 7 thereafter. 
These two functions are related through 


V"(s) = $ a(s, a) Q" (s, a) (2) 
TD methods seek the optimal action-value function 
Q*(s,a), from which an optimal policy 7* can easily be 
deduced. Q*(s,a) can be found by iteratively updating the 
estimate Q(s, a). 
The off-policy method Q-learning updates its Q values 
using the update rule 


Q(s a) — Q(st at) + [ri + 


y mare Q(s1+1,0) — Qlst,at)] (3) 


The max operator causes the estimation policy to be greedy, 
which guarantees the Q values converge to Q*(s,a). The 
behavior policy of Q-learning is usually exploratory and 
based on Q(s, a). 

For Sarsa the behavior policy and the estimation policy 
are equal. The update rule of Sarsa is 


Q(st,41) — Q(si at) +a [rii + 


7 Q(st41, at+1) — Q(st, at)] (4) 


Because Sarsa is on-policy, it will not converge to optimal Q 
values as long as exploration occurs. However, by annealing 
exploration over time, Sarsa can converge to optimal Q 
values, just like Q-learning. 


III. EXPECTED SARSA 


Since Sarsa’s convergence guarantee requires that every 
state be visited infinitely often, the behavior and therefore 
also the estimation policy is typically stochastic so as to 
ensure sufficient exploration. As a result, there can be sub- 
stantial variance in Sarsa updates, since a;+1 is not selected 
deterministically. 

Of course, variance can occur in updates for any TD 
method because the environment can introduce stochasticity 


through 7 and R. Since TD methods are typically used when 
a model of the environment is not available, there is little the 
agent can do about this stochasticity except employ a suitably 
low a. However, the additional variance introduced by Sarsa 
stems from the policy stochasticity, which is known to the 
agent. 

Expected Sarsa is a variation of Sarsa which exploits this 
knowledge to prevent stochasticity in the policy from further 
increasing variance. It does so by basing the update, not on 
Q(St+1, at+1), but on its expected value E{Q(st+1, @t+1)}. 
The resulting update rule is: 


Qlst at) — Qlsi at) +a [ri + (5) 
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Using this expectation reduces the variance in the update, 
as we show formally in Section V. Lower variance means 
that in practice œ can often be increased in order to speed 
learning, as we demonstrate empirically in Section VII. In 
fact, when the environment is deterministic, Expected Sarsa 
can employ a = 1, while Sarsa still requires a < 1 to cope 
with policy stochasticity. 

Algorithm 1 shows the complete Expected Sarsa algo- 
rithm. Because the update rule of Expected Sarsa, unlike 
Sarsa, does not make use of the action taken in s41, 
action selection can occur after the update. Doing so can be 
advantageous in problems containing states with returning 
actions, i.e. P(s:41 = s+) > 0. When st+1 = st, performing 
an update of Q(s+, a), will also update Q(s;41, a+), yielding 
a better estimate before action selection occurs. 


Algorithm 1 Expected Sarsa 


1: Initialize Q(s, a) arbitrarily for all s,a 

2: loop {over episodes} 

3: Initialize s 

4: repeat {for each step in the episode} 

5 choose a from s using policy m derived from Q 
6: take action a, observe r and s’ 

TO Vw =Sqm(s",a) Q(s",a) 

& Q(s,a) — As,a) + a [r +7 Vy — Q(s,a)] 
9 sas! 

0: until s is terminal 

11: end loop 


= 


Expected Sarsa can also be viewed, not as a lower- 
variance version of Sarsa, but as an on-policy version of Q- 
learning. Note the similarity between the expectation value 
E{Q(st+1,at+1)} used by Expected Sarsa and (2) relating 
V” (s) to Q7(s,a). Since Q(s, a) is an estimate of Q7 (s, a), 
its expectation value can be seen as the estimate V (s) for 
V” (s) using the relation: 


V (s) = X` x(s, a) Q(s, a) (6) 


If the policy 7 is greedy, 7(s, a) = 0 for all a except for the 
action for which Q has its maximal value. Therefore, in the 


case of a greedy policy, (6) simplifies to 
V(s) = max Q(s, a) (7) 


Thus, Q-learning’s update rule (3) is just a special case 
of Expected Sarsa’s update rule (5) for the case when 
the estimation policy is greedy. Nonetheless, the complete 
Expected Sarsa algorithm is different from that of Q-learning 
because the former is on-policy and the latter is off-policy. 


IV. CONVERGENCE 


In this section, we prove that Expected Sarsa converges 
to the optimal policy under some straightforward conditions 
given below. We make use of the following Lemma, which 
was also used to prove convergence of Sarsa [12]: 

Lemma 1: Consider a stochastic process (¢:, Az, Fẹ), 
where ¢;, Az, Fe : X — R satisfy the equations 


Arpi (ae) = (1 — Ge (ae) ) Alri) + G(r) Filz) , 


where x; E€ X and t =0,1,2,.... Let P, be a sequence of 
increasing o-fields such that Co and Ag are Po-measurable 
and ¢;, A; and F;_, are P,-measurable, t > 1. Assume that 
the following hold: 
1) the set X is finite, 
2) G(x) € [0,1], X Gele) = 00 , D2, (Ce(e))? < 00 
w.p.1 and Va Æ x, : G(x) = 0, 
3) || E{F;| Piy] < «||As|| + ce, where « € [0,1) and c 
converges to zero w.p.1, 
4) Var{ F(x) |P} < KA + 4||A;||)?, where K is some 
constant, 


where ||- || denotes a maximum norm. Then A, converges 
to zero with probability one. 

The idea is to apply Lemma 1 with X = S x A, P; = {Qo, 
So, G0, To, Q0, S1, Q1, ---, St, Qt}, Ve = (St, at), G(X) = 
Atl 84, at) and Alx) = QilSt, at) — Q* (Sz, Gt). If we can 
then prove that A; converges to zero with probability one, 
we have convergence of the Q values to the optimal values. 
The maximum norm specified in the lemma can then be 
understood as satisfying the following equation: 


||Ad|| = maxmax|Qi(s,a)—Q*(s,a)| 8) 


Theorem 1: Expected Sarsa as defined by update (5) con- 
verges to the optimal value function whenever the following 
assumptions hold: 

1) S and A are finite, 

2) arlsa) € [0,1] , So,ar(si,ae) = œ, 

> (ar(St,a4))? < œ w.p.l and V(s,a) # (sz, a4) : 


a,(s,a) = 0, 
3) The policy is greedy in the limit with infinite explo- 
ration, 


4) The reward function is bounded. 

Proof: To prove this theorem, we simply check that all 
the conditions of Lemma 1 are fulfilled. The first, second 
and fourth conditions of this lemma correspond to the first, 
second and fourth assumptions of the theorem. Below, we 
will show the third condition of the lemma holds. 


We can derive the value of F; as follows: 
1 

— (Ain == (1 = a) Ar) 3 

Qt 


= ret >) mlst41,a)Qilst+1,0) — Q* (st, at) y 
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F, = 


where all the values are taken over the state action pair 
(Sz, at), except when specified differently. 

If we can show that ||E{F;}|| < «||Ay|| + c where 
k € (0,1) and c converges to zero, all the conditions of 
the lemma can be fulfilled and we have convergence of A, 
to zero and therefore convergence of Q+ to Q*. We derive 
this as follows: 


ELF} | 
= ||E{r: + DD Ti(St41:0)Qi(st41:0) — Q* (st, x) }| 


< |[E{r; + ymax Q:(st+1,a) — Q" (st, a1)}|| + 
ELS. m(st+1,a)Qi(st+1, a) — max Qi(se41,4)}|| 
< ymax secon) — max Q (s, a)| + 
ymax |X m(s,a)Qi(s, a) — max Q: (s, a)| 
< ylä 
ymax | J ms, a)Qi(s, a) — max Qi (s,a)| , 


a 


where the second inequality results from the definition of 
Q* and the fact that the maximal difference in value over 
all states is always at least as large as a difference between 
values corresponding to a state s,,,. The third inequality 
follows directly from (8). The other (in)equalities are based 
on algebraic rewriting or definitions. 

We identify c = ymax, |, m(s,a)Q:(s,a) — 
max, Q:(s,a)| and k = y. Clearly, c converges to zero for 
policies that are greedy in the limit. Therefore, if y < 1, all 
of the conditions of Lemma 1 follow from the assumptions 
in the present theorem and we can apply the lemma to prove 
convergence of Q, to Q*. | 


V. VARIANCE ANALYSIS 


Section IV shows that Expected-Sarsa converges to the 
optimal policy under the same conditions as Sarsa. In this 
section, we further analyze the behavior of the two methods 
to show theoretically under what conditions Expected-Sarsa 
will in some sense perform better. Specifically, we show that 
both algorithms have the same bias and that the variance of 
Expected-Sarsa is lower. Finally, we describe which factors 
affect this difference in variance. In this section, we use 
Ue = Te + YJ a Me(St41, @)Qe(Se41,a) and ô = Ty + 
YQtlSt+1, @t+1) to denote the target of Expected-Sarsa and 
Sarsa, respectively. 

The bias of the updates of both algorithms under a certain 
policy m is given by the following equation: 


Bias(s,a) = Q"(s,a) — E{X;} (9) 


where X; is either v; or v;. Both algorithms have the same 
bias, since E{v,} = E{t;}. The variance is then given by: 
Var(s,a) = E{(X1)*} — (E{Xi}) (10) 


We first calculate this variance for Sarsa: 


Var(s,a) = Y TA(P E mera(Qu(s’,a’))? + (RE)? 
+ 2yRE, > ts'aQi(s',a’)) — (ELH 


Similarly, for Expected-Sarsa we get: 


1 


Var(s,a) = D (PO tora Qels", a)? + (RE)? 


+ Rgn D> tara’ Qel(s',a')) — (E18)? | 


Since E{v,} = E{d,}, the difference between the two 
variances simplifies to the following: 


PTE (DY toal, a- tora Qe(s!,"))?) 
The inner term is of the form: 
S wiri — (> win)? , 


where the w and x correspond to the m and Q values. 
When w; > 0 for all ¿ and > wi = 1, we can give an 
unbiased estimate of the variance of the weighed values wiz; 


as follows: 
>, wilr — 2)? 
a a 
where 7 is the weighted mean 5°, w;z;. Taking the numer- 
ator of this fraction and rewriting this gives us: 


> wilzi— 7) = > wiz? —25 wizi? +Y wiz? 


= > wir? — 27? + x 
i 


= > wiz? — 7? : 
i 


which is exactly the same quantity as given in (11). This 
shows that this quantity is closely related to the weighted 
variance of the w;x;. Therefore, the more the x; deviate from 
the weighted mean `; w;x;, the larger this quantity will be. 
In our context this occurs in settings where there is a large 
difference between the Q values of different actions and there 
is much exploration. In case of a greedy policy or when all 
Q values have the same value, this quantity is 0. 


(11) 


(12) 


VI. HYPOTHESES 


In this section, we formulate specific hypotheses about 
when Expected Sarsa will outperform Q-learning and Sarsa. 
These hypotheses are based on the central differences be- 
tween Expected Sarsa and these two alternatives: 1) unlike Q- 
learning, Expected Sarsa is on-policy and 2) Expected Sarsa 
has lower variance than Sarsa. 


For simplicity, we restrict our attention to the case where 
exploration is performed using e€-soft behavior policies, i.e., 
the agent takes a random action with probability € and uses 
the estimation policy otherwise. Using such exploration, off- 
policy methods can sometimes perform quite differently than 
on-policy methods. For example, in the cliff-walking task 
(detailed in Section VII), some actions can have disastrous 
consequences in certain states, e.g., when near a cliff. Off- 
policy methods try to estimate the optimal way to behave 
without exploration and then merely employ an e-soft version 
of the resulting policy. Consequently, they may never learn 
to avoid such catastrophic actions. By contrast, on-policy 
methods try to estimate the optimal way to behave given 
the exploration that is occurring. Therefore, they can learn 
policies that are qualitatively different from the optimal 
policy without exploration but that avoid catastrophic actions 
in the presence of exploration, e.g., by staying further away 
from the cliff. Based on this difference we can define two 
different types of problems: 


1) Problems where the optimal e-soft policy is better than 
the €-soft policy based on Q*(s, a). 
2) Problems where the optimal e-soft policy is equal to 
the €-soft policy based on Q*(s, a). 
Because Expected Sarsa is on-policy and Q-learning is off- 
policy, we we state the following hypothesis: 

Hypothesis 1: Expected Sarsa will outperform Q-learning 
for problems of Type 1. 

Section V demonstrated that the variance in the update 
target for Sarsa is larger than for Expected Sarsa, especially 
when the policy stochasticity is large and when there is a 
large spread in Q values of the actions of a state. Based 
on these facts, we can formulate a second hypothesis, one 
about the performance difference between Expected Sarsa 
and Sarsa. 

Hypothesis 2: Expected Sarsa will outperform Sarsa on 
problems of both Type 1 and Type 2. The size of the 
performance difference depends primarily on two factors: 


1) When environment stochasticity is high, performance 
difference will be small. 

2) When policy stochasticity is high, performance differ- 
ence will be large. 


VII. RESULTS AND DISCUSSION 


In this section we present a series of experiments to 
compare the online performance of Expected Sarsa to that 
of Sarsa and Q-learning in order to test the hypotheses 
described in the previous section. We start with the cliff 
walking problem. This is an example of a problem where 
an exploration policy based on the optimal action values 
Q*(s,a) is not equal to the optimal policy with exploration 
added. Sutton and Barto showed that Sarsa outperforms 
Q-learning on this problem [2]. We show that Expected 
Sarsa outperforms Q-learning as well as Sarsa, confirming 
Hypothesis 1 and providing some evidence for Hypothesis 
2. 


We then present results on two versions of the windy 
grid world problem, one with a deterministic environment 
and one with a stochastic environment. We do so in order 
to evaluate the influence of environment stochasticity on 
the performance difference between Expected Sarsa and 
Sarsa and confirm the first part of Hypothesis 2. We then 
present results for different amounts of policy stochasticity 
to confirm the second part of Hypothesis 2. For completeness, 
we also show the performance of Q-learning on this problem. 
Finally, we present results in other domains verifying the 
advantages of Expected Sarsa in a broader setting. All results 
presented below are averaged over numerous independent 
trials such that the standard error becomes negligible. 


A. Cliff Walking 


We begin by testing Hypothesis | using the cliff walking 
task, an undiscounted, episodic navigation task in which the 
agent has to find its way from start to goal in a deterministic 
grid world. Along the edge of the grid world is a cliff (see 
Figure 1). The agent can take any of four movement actions: 
up, down, left and right, each of which moves the agent one 
square in the corresponding direction. Each step results in a 
reward of -1, except when the agent steps into the cliff area, 
which results in a reward of -100 and an immediate return 
to the start state. The episode ends upon reaching the goal 
state. 


S G 


Fig. 1. The cliff walking task. The agent has to move from the start [S] 
to the goal [G], while avoiding stepping into the cliff (grey area). 


We evaluated the performance over the first n episodes as 
a function of the learning rate œ using an e-greedy policy 
with e = 0.1. Figure 2 shows the result for n = 100 and 
n = 100,000. We averaged the results over 50,000 runs and 
10 runs, respectively. 

Discussion. Expected Sarsa outperforms Q-learning and 
Sarsa for all learning rate values, confirming Hypothesis 1 
and providing some evidence for Hypothesis 2. The optimal 
a value of Expected Sarsa for n = 100 is 1, while for 
Sarsa it is lower, as expected for a deterministic problem. 
That the optimal value of Q-learning is also lower than 1 is 
surprising, since Q-learning also has no stochasticity in its 
updates in a deterministic environment. Our explanation is 
that Q-learning first learns policies that are sub-optimal in 
the greedy sense, i.e. walking towards the goal with a detour 
further from the cliff. Q-learning iteratively optimizes these 
early policies, resulting in a path more closely along the cliff. 
However, although this path is better in the off-line sense, in 
terms of on-line performance it is worse. A large value of 
q ensures the goal is reached quickly, but a value somewhat 
lower than 1 ensures that the agent does not try to walk right 


on the edge of the cliff immediately, resulting in a slightly 
better on-line performance. 

For n = 100,000, the average return is equal for all 
a values in case of Expected Sarsa and Q-learning. This 
indicates that the algorithms have converged long before the 
end of the run for all œ values, since we do not see any 
effect of the initial learning phase. For Sarsa the performance 
comes close to the performance of Expected Sarsa only for 
a = 0.1, while for large a, the performance for n = 100, 000 
even drops below the performance for n = 100. The reason 
is that for large values of a the Q values of Sarsa diverge. 
Although the policy is still improved over the initial random 
policy during the early stages of learning, divergence causes 
the policy to get worse in the long run. 


h x 


od 


average return 


xg y n= 100, Sarsa 


v. 8: n = 100, Q-learning 
—1207 2° x n= 100, Expected Sarsa 
—v— n= 1E5, Sarsa 
1408 — n = 1E5, Q-learning 
—— n= 1E5, Expected Sarsa 


-160 1 n 1 n 1 1 1 1 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


l alpha 


Fig. 2. Average return on the cliff walking task over the first n episodes 
for n = 100 and n = 100, 000 using an e-greedy policy with e = 0.1. The 
big dots indicate the maximal values. 


B. Windy Grid World 


We turn to the windy grid world task to further test Hy- 
pothesis 2. The windy grid world task is another navigation 
task, where the agent has to find its way from start to goal. 
The grid has a height of 7 and a width of 10 squares. There 
is a wind blowing in the ’up’ direction in the middle part of 
the grid, with a strength of 1 or 2 depending on the column. 
Figure 3 shows the grid world with a number below each 
column indicating the wind strength. Again, the agent can 
choose between four movement actions: up, down, left and 
right, each resulting in a reward of -1. The result of an action 
is a movement of 1 square in the corresponding direction plus 
an additional movement in the ’up’ direction, corresponding 
with the wind strength. For example, when the agent is in 
the square right of the goal and takes a left’ action, it ends 
up in the square just above the goal. 

1) Deterministic Environment: We first consider a de- 
terministic environment. As in the cliff walking task, we 
use an e-greedy policy with e = 0.1. Figure 4 shows the 
performance as a function of the learning rate a over the 
first n episodes for n = 100 and n = 100,000. For n = 100 
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Fig. 3. The windy grid world task. The agent has to move from start [S] 
to goal [G]. The numbers under the grid indicate the wind strength in the 
column above. 


the results are averaged over 10,000 independent runs, for 
n = 100,000 over 10 independent runs. 

Discussion. For the deterministic windy grid world task 
the performance of Q-learning and Expected Sarsa is essen- 
tially equal. The fact that for n = 100, 000 the average return 
is equal indicates that the behavior policies of Expected Sarsa 
and Q-learning are equal in the limit for this task, i.e., the 
optimal policy among the e-greedy policies (Expected Sarsa) 
is equal to the policy that is e-greedy with respect to Q*(s, a) 
(Q-learning). The optimal a is 1 for Expected Sarsa as well 
as Q-learning. Sarsa again has a lower optimal a. As in 
the cliff walking task we observed divergence of Q values 
for high œ values in the case of Sarsa. The performance 
difference for n = 100 between Expected Sarsa and Sarsa at 
their optimal values is (—45.0) — (—58.3) = 13.3 in favor 
of Expected Sarsa. 
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Fig. 4. Average return on the windy grid world task over the first n episodes 
for n = 100 and n = 100,000 and an e-greedy policy with e = 0.1 in a 
deterministic environment. The big dots indicate maximal values. 


2) Environment Stochasticity: We also consider a stochas- 
tic variation of the windy grid world problem and compare 
results to the performance difference in the deterministic case 
in order to evaluate the first part of Hypothesis 2. We added 
stochasticity to the environment by moving the agent with 
a probability of 20% in a random direction instead of the 
direction corresponding to the action. The performance as 
function of the learning rate is presented in Figure 5 for 


n = 100 and n = 100,000. Again, we averaged the results 
over 10,000 runs and 10 runs respectively. 

Discussion. As expected, the optimal a for Expected Sarsa 
and Q-learning in case of n = 100 drops considerably in 
comparison to the deterministic case, to a value of 0.6. The 
optimal a value of Sarsa also decreases, to 0.55. From the 
n = 100,000 case, we can see that the policy no longer 
converges for Expected Sarsa and Q-learning for all a values. 
Although not stable for high œ values, the average policy 
is better for Expected Sarsa than for Q-learning, which is 
likely due to the on-policy nature of Expected Sarsa. On the 
other hand, For n = 100, Q-learning slightly outperforms 
Expected Sarsa because it benefits more from optimistic 
initialization, i.e., initially overestimating the Q values to 
increase exploration during early learning. Since Q-learning 
uses the maximal Q value of the next state in its update, it 
takes longer for the Q values to decrease. 

The performance difference for n = 100 between Ex- 
pected Sarsa and Sarsa at their optimal values is (—93.7) — 
(—98.3) = 4.6 in favor of Expected Sarsa. The performance 
difference is less than half that of the deterministic case, 
confirming the first part of Hypothesis 2. 
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Fig. 5. Average return on the windy grid world task over the first n episodes 
for n = 100 and n = 100, 000 using a e-greedy policy with €e = 0.1 in a 
stochastic environment. The big dots indicate maximal values. 


3) Policy Stochasticity: To confirm the second part of 
Hypothesis 2, we repeat the stochastic windy grid world 
experiment but with higher policy stochasticity, using an e 
of 0.3 instead of 0.1. Figure 6 shows the results. 

Discussion For n = 100 the optimal a for Sarsa drops 
from 0.55 to 0.45 and the optimal a for Q-learning de- 
creases slightly, though for Expected Sarsa it stays the same. 
Furthermore, the performance difference between Q-learning 
and Expected Sarsa increases. The performance difference 
between Sarsa and Expected Sarsa also increases for n = 100 
and is now (—121.0) — (—136.4) = 15.4, confirming the 
second part of Hypothesis 2. Other experiments, not shown 
in this paper, confirmed that also the opposite is true: when 
policy stochasticity is low, i.e. using an e-greedy policy with 


e = 0.01 there is practically no performance difference 
between Sarsa and Expected Sarsa. 


-20 


-40 | 
-goL | = 100, Sarsa d 
© n = 100, Q-learning 
x: n = 100, Expected Sarsa 
-80 | —=— n = 1E5, Sarsa 
E —=— n = 1E5, Q-learning 
=] = 
3 -100l —*— n = 1E5, Expected Sarsa l 
o oa 
® 120F S g 
o` a eX 
Š 5 i 
[ eT Qe a 
-140 E gig T 
es i 
gis fa 
-16e0- Fv 
sy 
-180+ 
a 
200% 1 1 1 1 1 1 ES 1 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 


, alpha l 


Fig. 6. Average return on the windy grid world task over the first n episodes 
for n = 100 and n = 100, 000 using a e-greedy policy with €e = 0.3 in a 
stochastic environment. The big dots indicate maximal values. 


C. Other Domains 


To demonstrate that the advantage of Expected Sarsa holds 
more generally, we also tested in other domains. 

1) Maze: We compared Expected Sarsa to Sarsa and Q- 
learning on the maze problem shown in Figure 7. The goal 
of the agent is to find a path from start to goal, while 
avoiding hitting the walls. The reward for arriving at the 
goal is 100. When the agent bumps into a wall or border of 
the environment it stays at the same position, but receives a 
reward of -2. For all other steps a reward of -0.1 is received. 
The environment is stochastic and moves the agent with 
a probability of 10% in a random direction instead of the 
direction corresponding to the action. The discount factor ~y 
is set to 0.997. A trial is finished after the agent reaches the 
goal or 10,000 actions have been performed. An e-greedy 
behavior policy is used with e = 0.05 and we initialized the 
Q values to 0. 

We optimized a for each method such that the average 
reward over the first 2 x 10° timesteps is maximized. The 
optimal values were 0.24, 0.28 and 0.27 for Sarsa, Q-learning 
and Expected Sarsa respectively. We then plotted the reward 
as function of the number of timesteps for these optimal a 
values to get a more detailed look at performance. Figure 8 
shows the results, which are averaged over 100 trials. 

Discussion. Although Expected Sarsa and Q-learning per- 
form equally, Sarsa’s performance is lower and not mono- 
tonically increasing. It shows a drop in performance after 
0.2 x 10° timesteps, before it slowly increases again. This 
drop occurs in all one hundred runs. 

Although this is a clear demonstration of the possibility 
that Sarsa can be unstable in certain cases, we have not 
observed this phenomenon in previous research, and it is 
remarkable because the value function is represented in a 
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Fig. 7. The maze problem. The starting position is indicated by [S] and 
the goal position is indicated by [G]. 
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problem. The results are averaged over 100 runs. 


table, without the complications of function approximation. 
We explain this temporary performance drop of Sarsa as 
follows: since in our implementation we initialized all Q 
values to 0, while their real value is higher, all values 
start to increase in the beginning. However, the values of 
the best actions increase faster because they have a shorter 
propagation path to the final reward of 100. Therefore, 
initially Sarsa learns well. However, because of the high 
discount factor of 0.997, all action-values in a state start 
to get very close to each other. This makes it possible that 
after a bad exploration step, some values are updated in a 
way that makes the policy worse. After a while Sarsa finds 
a policy that is not optimal, but that is robust against such 
value updates. The same drop in performance also happens 
when using a learning rate of 0.04 for Sarsa, although initial 
learning performance was slower and the drop occurred later. 
The update targets of Expected Sarsa and Q-learning are 
not effected by the action selected in the next state and are 
therefore more robust towards performance drops. 

2) Cart Pole: As a final comparison, we test the on- 


line performance of Expected Sarsa, Sarsa and Q-learning 
on a cart-pole task. The goal was to balance a 1 m long 


pole, weighing 0.1 kg, on a cart that weighs 1.0 kg. The 
possible actions were all integer amounts between —10 and 
10 Newton, where positive and negative forces correspond to 
pushing the cart right and left, respectively. An action was 
performed every 0.02 s. If the cart was pushed further than 
2.4 m from the center of the track or if the pole drops further 
than 12 degrees to either side, the algorithm would receive 
a —1 reward and the cart would be reset to the center with 
the pole at a random angle between —3 and 3 degrees. A 
neural network with 15 sigmoidal hidden units was used to 
approximate the Q values. The input vector consisted of the 
position and velocity of the cart and the angle and angular 
velocity of the pole, all normalized to [-1,1]. The value of e€ 
was 0.05 and y was 0.95. Figure 9 shows the average reward 
during learning at optimized a values of 0.12, 0.16 and 0.16 
for Sarsa, Q-learning and Expected Sarsa respectively. 

Discussion. We see again that Expected Sarsa and Q- 
learning perform similar, while Sarsa is less stable and shows 
lower performance. This demonstrates that the results extend 
to the case of function approximation. 
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Fig. 9. The learning performance of the different methods on the cart pole. 
The results are averaged over 200 simulations. 


VIII. CONCLUSION 


In this paper we examined Expected Sarsa, a variation on 
the Sarsa algorithm intended to decrease the variance in the 
update rule, and compared it to the Sarsa and the Q-learning 
algorithm. 


We proved that Expected Sarsa converges under the same 
conditions as Sarsa. We also proved that the variance in the 
update rule of Expected Sarsa is smaller than the variance 
for Sarsa and that the difference in variance is largest when 
there is a high amount of exploration and a large spread 
in Q values of the actions of a specific state. Based on 
this theoretical analysis, we hypothesized that the on-line 
performance of Expected Sarsa will be higher than for Sarsa 
and that the difference in performance will be relatively large 
when there is a lot of policy exploration and small when the 
environment is very stochastic. We also formulated a second 
hypothesis based on the on-policy nature of Expected Sarsa 
that states that Expected Sarsa will outperform Q-learning for 
problems where an e-soft behavior policy based on Q*(s, a) 
is not equal to the optimal e-soft policy. We confirmed these 
hypotheses using experiments on the cliff walking task and 
the windy grid world task. Finally, we presented results on 
two additional problems to verify the advantages of Expected 
Sarsa in a broader setting. 
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